AITopics

Technology: Information Technology > Artificial Intelligence (1.00)

Neural Information Processing SystemsFeb-17-2026, 12:23:55 GMT

dd03f856fc7f2efeec8b1c796284561d-Paper-Conference.pdf

Short-horizon tasks require few steps: e.g., go to a tree and chop it down, dig a hole.

large language model, machine learning, reinforcement learning, (23 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(7 more...)

Genre: Research Report (0.93)

Industry: Leisure & Entertainment > Games (0.70)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Neural Information Processing SystemsFeb-16-2026, 10:56:44 GMT

aa933b5abc1be30baece1d230ec575a7-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States > Michigan (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (0.67)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Neural Information Processing SystemsFeb-9-2026, 14:16:46 GMT

24c523085d10743633f9964e0623dbe0-Paper-Conference.pdf

edited image, machine learning, natural language, (15 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

arXiv.org Artificial IntelligenceNov-18-2025

See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

Zhang, Hantao, Liu, Jingyang, Li, Ed

We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.

large language model, machine learning, natural language, (21 more...)

2508.15222

Genre:

Workflow (0.70)
Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

arXiv.org Artificial IntelligenceNov-5-2025

Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Zhong, Zesen, Zhang, Duomin, Li, Yijia

Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.

artificial intelligence, deep learning, machine learning, (16 more...)

2507.14809

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-5-2025

LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency

Liu, Fangbing, Duan, Pengfei, Li, Wen, He, Yi

Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.

large language model, machine learning, natural language, (15 more...)

2511.01894

Genre: Research Report > New Finding (0.48)

Industry: Media > Photography (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.90)
(2 more...)

arXiv.org Artificial IntelligenceNov-4-2025

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Liu, Chaoqun, Aljunied, Mahani, Chen, Guizhen, Chan, Hou Pong, Xu, Weiwen, Rong, Yu, Zhang, Wenxuan

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

artificial intelligence, large language model, natural language, (17 more...)

2511.0167

Country:

Asia > Southeast Asia (0.81)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

arXiv.org Artificial IntelligenceOct-21-2025

Learning to play: A Multimodal Agent for 3D Game-Play

Yue, Yuguang, Salia, Irakli, Hunt, Samuel, Green, Christopher, Shi, Wenzhe, Hunt, Jonathan J

We argue that 3-D first-person video games are a challenging environment for real-time multi-modal reasoning. We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games, which is both substantially larger and more diverse compared to prior publicly disclosed datasets, and contains text instructions. We demonstrate that we can learn an inverse dynamics model from this dataset, which allows us to impute actions on a much larger dataset of publicly available videos of human game play that lack recorded actions. We then train a text-conditioned agent for game playing using behavior cloning, with a custom architecture capable of realtime inference on a consumer GPU. We show the resulting model is capable of playing a variety of 3-D games and responding to text input. Finally, we outline some of the remaining challenges such as long-horizon tasks and quantitative evaluation across a large set of games.

arxiv preprint arxiv, large language model, machine learning, (19 more...)